A Fault Model for Upgrades in Distributed Systems (CMU-PDL-08-115)

نویسندگان

  • Tudor Dumitraş
  • Soila Kavulya
  • Priya Narasimhan
چکیده

Recent studies, and a large body of anecdotal evidence, suggest that upgrades are unreliable and often end in failure, causing downtime and data-loss. While this is sometimes due to software defects in the new version, most upgradefailures are the result of faults in the upgrade procedure, such as broken dependencies. In this paper, we present data on upgrade failures from three independent sources — a user study, a survey and a field study — and, through statistical cluster analysis, we construct a novel fault model for upgrades in distributed systems. We identify four distinct types of faults: (1) simple configuration errors (e.g., typos); (2) semantic configuration errors (e.g., misunderstood effects of parameters); (3) broken environmental dependencies (e.g., incorrect libraries, port conflicts); and (4) complex procedural errors. We estimate that, on average, Type 1 faults occur in 15.2 % of upgrades, and Type 4 faults occur in 16.8 % of upgrades. A Fault Model for Upgrades in Distributed Systems Tudor Dumitraş, Soila Kavulya and Priya Narasimhan Carnegie Mellon University Pittsburgh, PA 15217 [email protected] [email protected] [email protected] Abstract Recent studies, and a large body of anecdotal evidence, suggest that upgrades are unreliable and often end in failure, causing downtime and data-loss. While this is sometimes due to software defects in the new version, most upgradefailures are the result of faults in the upgrade procedure, such as broken dependencies. In this paper, we present data on upgrade failures from three independent sources — a user study, a survey and a field study — and, through statistical cluster analysis, we construct a novel fault model for upgrades in distributed systems. We identify four distinct types of faults: (1) simple configuration errors (e.g., typos); (2) semantic configuration errors (e.g., misunderstood effects of parameters); (3) broken environmental dependencies (e.g., incorrect libraries, port conflicts); and (4) complex procedural errors. We estimate that, on average, Type 1 faults occur in 15.2 % of upgrades, and Type 4 faults occur in 16.8 % of upgrades.Recent studies, and a large body of anecdotal evidence, suggest that upgrades are unreliable and often end in failure, causing downtime and data-loss. While this is sometimes due to software defects in the new version, most upgradefailures are the result of faults in the upgrade procedure, such as broken dependencies. In this paper, we present data on upgrade failures from three independent sources — a user study, a survey and a field study — and, through statistical cluster analysis, we construct a novel fault model for upgrades in distributed systems. We identify four distinct types of faults: (1) simple configuration errors (e.g., typos); (2) semantic configuration errors (e.g., misunderstood effects of parameters); (3) broken environmental dependencies (e.g., incorrect libraries, port conflicts); and (4) complex procedural errors. We estimate that, on average, Type 1 faults occur in 15.2 % of upgrades, and Type 4 faults occur in 16.8 % of upgrades.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Fault Model for Upgrades in Distributed Systems

Recent studies, and a large body of anecdotal evidence, suggest that upgrades are unreliable and often end in failure, causing downtime and data-loss. While this is sometimes due to software defects in the new version, most upgradefailures are the result of faults in the upgrade procedure, such as broken dependencies. In this paper, we present data on upgrade failures from three independent sou...

متن کامل

No Downtime for Data Conversions: Rethinking Hot Upgrades (CMU-PDL-09-106)

Unavailability in enterprise systems is usually the result of planned events, such as upgrades, rather than failures. Major system upgrades entail complex data conversions that are difficult to perform on the fly, in the face of live workloads. Minimizing the downtime imposed by such conversions is a time-intensive and error-prone manual process. We present Imago, a system that aims to simplify...

متن کامل

Ganesha: Black-Box Fault Diagnosis for MapReduce Systems (CMU-PDL-08-112)

Ganesha aims to diagnose faults transparently in MapReduce systems, by analyzing OS-level metrics alone. Ganesha’s approach is based on peer-symmetry under fault-free conditions, and can diagnose faults that manifest asymmetrically at nodes within a MapReduce system. While our training is performed on smaller Hadoop clusters and for specific workloads, our approach allows us to diagnose faults ...

متن کامل

Why Do Upgrades Fail and What Can We Do about It? Toward Dependable, Online Upgrades in Enterprise Systems

Enterprise-system upgrades are unreliable and often produce downtime or data-loss. Errors in the upgrade procedure, such as broken dependencies, constitute the leading cause of upgrade failures. We propose a novel upgradecentric fault model, based on data from three independent sources, which focuses on the impact of procedural errors rather than software defects. We show that current approache...

متن کامل

Towards Self-Predicting Systems: What if You Could Ask “What-if”? (CMU-PDL-05-101)

Today, management and tuning questions are approached using if...then... rules of thumb. This reactive approach requires expertise regarding of system behavior, making it difficult to deal with unforeseen uses of a system’s resources and leading to system unpredictability and large system management overheads. We propose a What...if... approach that allows interactive exploration of the effects...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009